Introduction


Source:media.giphy


I know what you’re thinking: “Oh no, not another thing about coronavirus ugh.” But I’m happy to tell you that we’re looking at COVID’s RNA virus sibling, good ‘ol Orthomyxoviridae Influenza. I’ve done quite a few research papers in college about the flu, and while we all may be desensitized the wrath of the flu, or shall I say, “coronavirus’ shadow”, I find the common ailment facinating. I discovered the CDC has a whole archive of flu vaccination statistics from the past 10 years, and that’s where I discovered the ’vaccinations’ dataset, described below. While you think it may be easy to get the CDC to cough up case numbers per state in the year of the great swine flu (2009) this is not the case (puns intended). I am not proud of it, but the ‘flu_cases’ data set came from Wikipedia. Unfortunately, both the fact that the validity of these case numbers may be compromised, and the fact that the data is raw number of cases rather than a rate (divided by that state’s population during the flu season) has made the numbers, and therefore predictions somewhat precarious, I still predicted that states with the highest number of vaccinations would have the lowest number of cases, hospitalizations, and deaths. So please, peruse the wonderful stats I have discovered and displayed through the power of data science!


This data is about flu cases/hospitalizations/deaths per state and percent of people who recieved the flu vaccination 18+ years in 2009-2010 Swine Flu Epidemic.

setwd("/Users/emilyreed/Desktop")
flu_cases <- read.csv("Flu_cases.csv")
vaccinations <- read.csv("Vaccinations.csv")


Flu cases has 56 observations (1 per state, plus a few territories). “Cases”, “hospitalizations”, and “deaths” refers to the number of cases, hospital visits, and death due to influenza reported in that state for the 2009-2010 flu season. This data is NOT separated by age group.

# flu cases table turning cases and
# hospitalizations into numeric values for later
# calculations
flu_cases1 <- flu_cases %>% mutate_at("Hospitalizations", 
    as.numeric)
glimpse(flu_cases1)
## Rows: 56
## Columns: 4
## $ State            <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Califor…
## $ cases            <dbl> 2453, 1563, 8726, 154, 10545, 1321, 5491, 381, 54, 3…
## $ Hospitalizations <dbl> NA, 18, NA, NA, NA, 578, 766, NA, NA, NA, 860, NA, 3…
## $ Deaths           <int> 19, 13, 152, 32, 657, 70, 35, 7, 1, 230, 81, 13, 23,…


Vaccinations is a dataset with 51 observations (one per state, plus District of Colubmbia). “Perc_vaccinated” refers to the percentage of each age group vaccinated in each state in the 2009-2010 flu season. There are three age groups: less than 6 months, 6 months to 17 years, and 18 years old and up.

# flu vaccinations table
glimpse(vaccinations)
## Rows: 51
## Columns: 4
## $ Names                           <chr> "Alabama", "Alaska", "Arizona", "Arka…
## $ perc_vaccinated.less6months     <dbl> 45.8, 45.1, 45.9, 55.2, 45.4, 49.7, 5…
## $ perc_vaccinated.6months_17years <dbl> 53.2, 51.3, 53.1, 66.0, 53.3, 55.8, 6…
## $ perc_vaccinated.18years_up      <dbl> 43.0, 42.9, 43.6, 49.4, 42.3, 47.8, 4…

Tidy Data


Vaccinations dataset needs to be ’tidy’ed!

vaccinations2 <- vaccinations %>% 
  pivot_longer(2:4, names_to="age_group",values_to="percentage_vaccinated") %>% #pivoting to condense all perc_vaccinated rows
  separate("age_group", into=c("percent", "age_group"), sep="\\.", convert=T ) %>% #separating perc_vaccinated.agegroup into age group and percent
  select(-percent) %>% rename("State"="Names") #getting rid of pointless row "percent" and renaming "name" to "state" for joining and readability purposes

glimpse(vaccinations2)
## Rows: 153
## Columns: 3
## $ State                 <chr> "Alabama", "Alabama", "Alabama", "Alaska", "Ala…
## $ age_group             <chr> "less6months", "6months_17years", "18years_up",…
## $ percentage_vaccinated <dbl> 45.8, 53.2, 43.0, 45.1, 51.3, 42.9, 45.9, 53.1,…
# Lets join the data!
cases_vaccinations <- inner_join(flu_cases1, vaccinations2, 
    by = "State", convert = T)
glimpse(cases_vaccinations)
## Rows: 150
## Columns: 6
## $ State                 <chr> "Alabama", "Alabama", "Alabama", "Alaska", "Ala…
## $ cases                 <dbl> 2453, 2453, 2453, 1563, 1563, 1563, 8726, 8726,…
## $ Hospitalizations      <dbl> NA, NA, NA, 18, 18, 18, NA, NA, NA, NA, NA, NA,…
## $ Deaths                <int> 19, 19, 19, 13, 13, 13, 152, 152, 152, 32, 32, …
## $ age_group             <chr> "less6months", "6months_17years", "18years_up",…
## $ percentage_vaccinated <dbl> 45.8, 53.2, 43.0, 45.1, 51.3, 42.9, 45.9, 53.1,…


I performed an inner join, which allowed me to take only the State variables found both in flu cases and vaccinations data sets. I used an inner join because I only wanted to look at flu cases and vaccinations in the 50 formal United States, as well as avoid extra observations with a bunch of missing values. Using ‘inner_join’, I dropped the ‘Washington DC’ observation from the vaccinations dataset, and all the U.S territories (i.e. American Samoa, Guam, Puerto Rico,Northen Mariana Islands, U.S Virgin Islands ) from the flu_cases data set. The final merged data set, “cases_vaccinations” contains 6 variables of State, cases, hospitalizations, deaths, age_group, percentage_vaccinated, and 150 observations total (1 for each state at a particular age group)


Data Statistics

Finding the average % vaccinated for each state across all age groups.

# Find mean % of vaccinations across all age groups
# for each state
cases_vaccinations <- cases_vaccinations %>% group_by(State) %>% 
    mutate(average_vaccinations = mean(percentage_vaccinated))
cases_vaccinations
## # A tibble: 150 x 7
## # Groups:   State [50]
##    State cases Hospitalizations Deaths age_group percentage_vacc…
##    <chr> <dbl>            <dbl>  <int> <chr>                <dbl>
##  1 Alab…  2453               NA     19 less6mon…             45.8
##  2 Alab…  2453               NA     19 6months_…             53.2
##  3 Alab…  2453               NA     19 18years_…             43  
##  4 Alas…  1563               18     13 less6mon…             45.1
##  5 Alas…  1563               18     13 6months_…             51.3
##  6 Alas…  1563               18     13 18years_…             42.9
##  7 Ariz…  8726               NA    152 less6mon…             45.9
##  8 Ariz…  8726               NA    152 6months_…             53.1
##  9 Ariz…  8726               NA    152 18years_…             43.6
## 10 Arka…   154               NA     32 less6mon…             55.2
## # … with 140 more rows, and 1 more variable: average_vaccinations <dbl>
# find states with average vaccinations % that are
# above the National average vaccination %
national_average_vaccinated <- cases_vaccinations %>% 
    filter(!is.na(average_vaccinations)) %>% summarize(national_avg = mean(average_vaccinations)) %>% 
    summarize(mean(national_avg))
national_average_vaccinated
## # A tibble: 1 x 1
##   `mean(national_avg)`
##                  <dbl>
## 1                 51.6
cases_vaccinations %>% filter(average_vaccinations > 
    national_average_vaccinated) %>% select(State, 
    average_vaccinations) %>% summarize(avg_vaccination_percent = mean(average_vaccinations)) %>% 
    arrange(desc(avg_vaccination_percent))
## # A tibble: 21 x 2
##    State         avg_vaccination_percent
##    <chr>                           <dbl>
##  1 Rhode Island                     66.3
##  2 Hawaii                           64.5
##  3 Vermont                          62.3
##  4 Massachusetts                    61.6
##  5 South Dakota                     61.0
##  6 Maine                            60.6
##  7 New Hampshire                    58.9
##  8 Virginia                         58.2
##  9 Iowa                             57.9
## 10 Minnesota                        57.4
## # … with 11 more rows


Let’s see which states had the most cases of flu. Which states had the fewest percent average vaccinated citizens?

# Which states had the most cases of flu
cases_vaccinations %>% group_by(State) %>% summarize(case = mean(cases)) %>% 
    arrange(desc(case))
## # A tibble: 50 x 2
##    State         case
##    <chr>        <dbl>
##  1 Pennsylvania 10940
##  2 California   10545
##  3 Wisconsin     9579
##  4 Arizona       8726
##  5 Texas         6128
##  6 Nevada        5516
##  7 Connecticut   5491
##  8 Florida       3676
##  9 Illinois      3387
## 10 New York      2738
## # … with 40 more rows
# Which states had the lowest average % vaccinated
# across all age groups
cases_vaccinations %>% group_by(State) %>% summarize(avg_vaccination_percent = mean(average_vaccinations)) %>% 
    arrange(avg_vaccination_percent)
## # A tibble: 50 x 2
##    State       avg_vaccination_percent
##    <chr>                         <dbl>
##  1 Nevada                         40.2
##  2 Idaho                          43.7
##  3 Michigan                       43.7
##  4 Florida                        43.8
##  5 Georgia                        43.8
##  6 Mississippi                    44.3
##  7 Montana                        44.3
##  8 Oregon                         46.1
##  9 Alaska                         46.4
## 10 Missouri                       46.6
## # … with 40 more rows
# Do the states that had the most flu match with
# the states that had the lowest average %
# vaccinated? (Nope)
cases_vaccinations %>% group_by(State) %>% summarize(case_in_state = mean(cases), 
    avg_vaccination_state = (mean(average_vaccinations))) %>% 
    arrange(desc(case_in_state), avg_vaccination_state) %>% 
    select(State, case_in_state, avg_vaccination_state)
## # A tibble: 50 x 3
##    State        case_in_state avg_vaccination_state
##    <chr>                <dbl>                 <dbl>
##  1 Pennsylvania         10940                  51.8
##  2 California           10545                  47  
##  3 Wisconsin             9579                  48.9
##  4 Arizona               8726                  47.5
##  5 Texas                 6128                  47.9
##  6 Nevada                5516                  40.2
##  7 Connecticut           5491                  53.3
##  8 Florida               3676                  43.8
##  9 Illinois              3387                  48.5
## 10 New York              2738                  50.6
## # … with 40 more rows


Computing the mean and sd for national cases, hospitalizations, and deaths.

# Finding the national average of cases
national_average_cases <- cases_vaccinations %>% summarize(average_cases_US = mean(cases)) %>% 
    summarize(average_cases_US = mean(average_cases_US))
national_average_cases

# Finding the national sd of cases between states
national_sd_cases <- cases_vaccinations %>% summarize(sd_cases_US = mean(cases)) %>% 
    summarize(sd_cases = sd(sd_cases_US))
national_sd_cases

# Finding the national average of hospitalizations
national_average_hospitalizations <- cases_vaccinations %>% 
    filter(!is.na(Hospitalizations)) %>% summarize(average_hosp_US = mean(Hospitalizations)) %>% 
    summarize(average_hosp_US = mean(average_hosp_US))
national_average_hospitalizations

# Finding the national sd of hospitalizations
# between states
national_sd_hospitalizations <- cases_vaccinations %>% 
    filter(!is.na(Hospitalizations)) %>% summarize(sd_hosp_US = mean(Hospitalizations)) %>% 
    summarize(sd_hosp_US = sd(sd_hosp_US))
national_sd_hospitalizations

# Finding the national average of deaths (this data
# is obviously not very accurate, oop)
national_average_deaths <- cases_vaccinations %>% filter(!is.na(Deaths)) %>% 
    summarize(average_deaths_US = mean(Deaths)) %>% 
    summarize(average_deaths_US = mean(average_deaths_US))
national_average_deaths

# Finding the national sd of deaths between states

national_sd_deaths <- cases_vaccinations %>% filter(!is.na(Deaths)) %>% 
    summarize(sd_deaths_US = mean(Deaths)) %>% summarize(sd_deaths_US = sd(sd_deaths_US))
national_sd_deaths

cases_vaccinations %>% arrange(-Hospitalizations)
# make a table of all means/sd
table_mean_sd <- data.frame(national_average_cases, 
    national_sd_cases, national_average_hospitalizations, 
    national_sd_hospitalizations, national_average_deaths, 
    national_sd_deaths)
table_mean_sd %>% pivot_longer(1:6, names_to = "var", 
    values_to = "stat") %>% separate(var, into = c("stat_type", 
    "vars")) %>% pivot_wider(names_from = stat_type, 
    values_from = stat)
## # A tibble: 3 x 3
##   vars   average    sd
##   <chr>    <dbl> <dbl>
## 1 cases   2264.  2656.
## 2 hosp     407.   333.
## 3 deaths    66.6  100.


Finding the quantile, min, max for all numeric variables.

case_quantile <- cases_vaccinations$cases
cases_quantile <- quantile(case_quantile, c(0.25, 0.5, 
    0.75))

min_cases <- min(case_quantile)
max_cases <- max(case_quantile)

percent_vaccinated <- cases_vaccinations$percentage_vaccinated
vax_quantile <- quantile(percent_vaccinated, c(0.25, 
    0.5, 0.75), na.rm = T)
min_vax <- min(percent_vaccinated)
max_vax <- max(percent_vaccinated)

table_quantile <- data.frame(cases_quantile, min_cases, 
    max_cases, vax_quantile, min_vax, max_vax)

# showing table of quantile, min and max
table_quantile %>% as.data.frame %>% rownames_to_column("quantile") %>% 
    pivot_longer(c(cases_quantile, vax_quantile), names_to = "quantile_var", 
        values_to = "quantile_number") %>% separate(quantile_var, 
    into = c("quantile_var", "NA"), sep = "_", ) %>% 
    select(-"NA") %>% group_by(quantile_var) %>% pivot_longer(2:5, 
    names_to = "min_max", values_to = "min_max_values") %>% 
    separate(min_max, into = c("min_max", "type")) %>% 
    filter(quantile_var == type) %>% select(-type) %>% 
    pivot_wider(names_from = quantile, values_from = quantile_number) %>% 
    pivot_wider(names_from = min_max, values_from = min_max_values)
## # A tibble: 2 x 6
## # Groups:   quantile_var [2]
##   quantile_var `25%`  `50%`  `75%`   min     max
##   <chr>        <dbl>  <dbl>  <dbl> <dbl>   <dbl>
## 1 cases        722   1306.  2221    54   10940  
## 2 vax           45.9   49.8   55.2  38.1    82.1


The national average vaccination percentage was 51% for the 2009-2010 swine flu season. It would be interesting to see if the seasons that the national average was higher also had lower number of cases. The average and standard deviation for numeric variables “Cases”, “Hospitalizations”, and “Deaths” between each state, and are displayed in table format. The quantile cutoff values for cases and average percent vaccinated for all three age groups per state were calculated, as well as the minimum and maximum number of cases and vaccination percentage across all 50 states.


Graphing Time!!!!

Graph 1: Correlation heat map of numeric variables

# make unique ID
df <- unite(cases_vaccinations, State, age_group, col = "State_AgeGroup", 
    remove = F) %>% ungroup() %>% select(-State) %>% 
    select(-age_group)

corheatmat <- df %>% na.omit %>% select_if(is.numeric) %>% 
    cor(use = "pair")


tidycor <- corheatmat %>% as.data.frame %>% rownames_to_column("var1") %>% 
    pivot_longer(-1, names_to = "var2", values_to = "correlation")


tidycor %>% ggplot(aes(var1, var2, fill = correlation)) + 
    geom_tile() + scale_fill_gradient2(low = "red", 
    mid = "white", high = "blue") + geom_text(aes(label = round(correlation, 
    2)), color = "black", size = 4) + ggtitle("Correlation Heat Map of Numeric Variables") + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
    coord_fixed()


This correlation heat map shows what we would hopefully expect from this merged dataset. The “average_vaccinations” variable, which describes the average vaccination rate across all age groups for each state, is negatively correlated with deaths and hospitalizations, and very loosely coorelated (.02) with cases (probably by lack of data, or spurious correlation). We would want a negative correlation between these variables because it indicates that the more people that are vaccinated in that state, the fewer number of cases, hospitalizations, and deaths are recorded in that state. We also see a bunch of blue in the middle of the plot, which is also what we would expect. It is intuitive that the number of cases would naturally be postively correlated with an increase in hospitalizations and deaths.

Graph 2: Average vaccination % (for all age groups) and cases per state graph (scatterplot) for states with case numbers above national average

cases_vaccinations %>% group_by(State) %>% filter(!is.na(average_vaccinations)) %>% 
    mutate(avg_vaccination_percent = mean(average_vaccinations)) %>% 
    select(State, avg_vaccination_percent, cases, age_group) %>% 
    filter(cases > national_average_cases) %>% ggplot() + 
    geom_point(aes(avg_vaccination_percent, cases, 
        col = State)) + ggtitle("Average % Vaccinations and Cases for State Cases Above National Average") + 
    xlab("Average Vaccination Percent") + ylab("Number of Cases")


This scatterplot depicts the relationship between number of cases and average vaccination percentage of all age groups for all the states that are above the national average number of cases (2264.32 cases). There does not seem to be any correlation depicted in this graph between the two variables (we would expect a negative correlation). Since that number of cases is not a rate per person per state, but rather the raw numbers, it makes sense that Texas and California are above the national average number of cases just by shear population size. It is interesting, however, that Pennsylvania has a high average vaccination percentage (51.833%), and yet a high case number (10940 cases). While this outlier seems to contridict graph 1, the increase in cases could be explained by the cold weather experienced during flu season, driving people in Pennsylvania into close quarters to spread the disease easier. This could also explain the high number of cases in New York and Connecticut despite a relatively high average vaccination percent.

Graph 3: States vaccination percentage for each age group, facet by state, outline facet label color by quantile distinction of cases

# determining if states are in 1st, 2nd or 3rd, or
# 4th quantile for cases for bar graph below
quantile_lowest_cases <- quantile(case_quantile, 0.25)
quantile_highest_cases <- quantile(case_quantile, 0.75)
cases_vaccinations <- cases_vaccinations %>% filter(!is.na(age_group)) %>% 
    mutate(states_quantile = case_when(cases <= quantile_lowest_cases ~ 
        "low", cases > quantile_lowest_cases & cases <= 
        quantile_highest_cases ~ "med", cases > quantile_highest_cases ~ 
        "high"))
p1 = cases_vaccinations %>% group_by(State) %>% filter(!is.na(percentage_vaccinated)) %>% 
    ggplot(aes(x = age_group, y = percentage_vaccinated, 
        fill = age_group)) + geom_bar(stat = "summary") + 
    scale_fill_brewer(palette = "Dark2") + facet_wrap(states_quantile ~ 
    State) + geom_hline(yintercept = 51.62267, linetype = "longdash") + 
    ggtitle("% Vaccinated by Age Group by State") + 
    theme(plot.title = (element_text(size = 20, face = "bold", 
        hjust = 0.5)), axis.title.x = element_blank(), 
        axis.text.x = element_blank(), strip.text = element_text(size = 2)) + 
    ylab("Percent Vaccinated")

p2 = cases_vaccinations %>% group_by(State) %>% filter(!is.na(percentage_vaccinated)) %>% 
    ggplot(aes(x = age_group, y = percentage_vaccinated, 
        fill = age_group)) + facet_wrap(~State) + geom_rect(aes(fill = states_quantile), 
    xmin = -Inf, xmax = Inf, ymin = -Inf, ymax = Inf) + 
    theme_minimal()

g1 <- ggplotGrob(p1)
g2 <- ggplotGrob(p2)

gtable_select <- function(x, ...) {
    matches <- c(...)
    x$layout <- x$layout[matches, , drop = FALSE]
    x$grobs <- x$grobs[matches]
    x
}

panels <- grepl(pattern = "panel", g2$layout$name)
strips <- grepl(pattern = "strip_t", g2$layout$name)
stript <- grepl(pattern = "strip-t", g2$layout$name)
g2$layout$t[panels] <- g2$layout$t[panels] - 1
g2$layout$b[panels] <- g2$layout$b[panels] - 1

new_strips <- gtable_select(g2, panels | strips | stript)
grid.newpage()
grid.draw(new_strips)

gtable_stack <- function(g1, g2) {
    g1$grobs <- c(g1$grobs, g2$grobs)
    g1$layout <- transform(g1$layout, z = z - max(z), 
        name = "g2")
    g1$layout <- rbind(g1$layout, g2$layout)
    g1
}

# BIG thanks to source code:
# https://stackoverflow.com/questions/19440069/ggplot2-facet-wrap-strip-color-based-on-variable-in-data-set?answertab=votes#tab-top
new_plot <- gtable_stack(g1, new_strips)
grid.newpage()
grid.draw(new_plot)


Hold onto your britches for this doozy. These faceted bar graphs are showing vaccination percentage for each recorded age group in each state, shown in the legend. The dashed line indicates the national average vaccinated percent (51.62267%). The facet labels are colored according to quantile distinction for number of cases in that state. A pink label indicates that the state is considered “high”, or above the 3rd quartile. The blue label indicates “medium”, or between the 1st and 3rd quantiles (between >25% and 75%). The green label indicates “low”, or in the 1st quartile for cases (25% and below).

Bars above the dashed line symbolize for that age group in that specific state, their vaccination percentage rate is above the overall national average percentage rate. It looks like the highest bar for percent vaccinations for the nation as a whole is the 6 months to 17 years age group. This is probably due to parental influence. Looking at the pink labeled states, it looks as though almost all these states’ purple “18 years+” group is less than the national average for vaccinations. It is interesting that Delaware, a green labeled group, looks as though their vaccination percentage across all groups is low, yet with a few number of cases. This may be due to Deleware’s small population, or maybe because the Delewarians are hermits that never leave the house ever, neither to get vaccinated or to catch the virus.


Clustering Statistics

# PAM creating a data frame to examine clusters
df2 <- unite(cases_vaccinations, State, age_group, 
    col = "State_AgeGroup", remove = F) %>% ungroup() %>% 
    mutate_if(is.character, as.factor) %>% select(-State, 
    -age_group, -states_quantile) %>% na.omit

# choosing clusters
sil_width <- vector()
for (i in 2:50) {
    pam_fit <- pam(df2, k = i)
    sil_width[i] <- pam_fit$silinfo$avg.width
}
ggplot() + geom_line(aes(x = 1:50, y = sil_width)) + 
    scale_x_continuous(name = "k", breaks = 1:50) + 
    theme(axis.text.x = element_text(size = 5)) + geom_point(aes(x = 8, 
    y = 0.63), color = "red")

# clustering using PAM
pam1 <- df2 %>% select(-1) %>% scale %>% pam(k = 8)
pam1
## Medoids:
##      ID      cases Hospitalizations     Deaths percentage_vaccinated
## [1,] 28 -0.3681047      -1.21289162 -0.5407067            -1.0827275
## [2,] 55 -0.3444917      -0.31923551  0.1087812            -0.3968249
## [3,]  7  3.5936327       1.09597415 -0.1823685            -0.2076104
## [4,] 10 -0.3235023       1.38267611  0.8478536            -1.1536829
## [5,] 61 -0.1468421       1.36132596 -0.4735183            -0.3140435
## [6,] 52  0.6113977       0.04676699 -0.4511221             0.7502881
## [7,] 31 -0.8324938      -1.11834097 -0.6302912             0.1826446
## [8,] 37  1.1859808       1.53212713  3.6473704            -0.5623876
##      average_vaccinations
## [1,]          -1.38751988
## [2,]          -0.50677678
## [3,]           0.02271758
## [4,]          -1.46615765
## [5,]           0.04368765
## [6,]           1.22849682
## [7,]           0.30581357
## [8,]          -0.39668389
## Clustering vector:
##  [1] 1 1 1 2 2 2 3 3 3 4 4 4 1 2 1 5 6 5 6 6 6 5 5 5 6 6 6 1 1 1 7 7 7 2 2 2 8 8
## [39] 8 2 2 2 7 7 7 2 2 2 7 6 7 6 6 6 2 2 2 7 6 7 5 5 5
## Objective function:
##     build      swap 
## 0.8664379 0.8638319 
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"
# looking at clusters
df2 %>% mutate(cluster = pam1$clustering) %>% arrange(cluster)
## # A tibble: 63 x 7
##    State_AgeGroup cases Hospitalizations Deaths percentage_vacc…
##    <fct>          <dbl>            <dbl>  <int>            <dbl>
##  1 Alaska_less6m…  1563               18     13             45.1
##  2 Alaska_6month…  1563               18     13             51.3
##  3 Alaska_18year…  1563               18     13             42.9
##  4 Idaho_less6mo…  1171              389     23             42.9
##  5 Idaho_18years…  1171              389     23             41.4
##  6 Montana_less6…   961                9     19             44  
##  7 Montana_6mont…   961                9     19             45.6
##  8 Montana_18yea…   961                9     19             43.4
##  9 Colorado_less…  1321              578     70             49.7
## 10 Colorado_6mon…  1321              578     70             55.8
## # … with 53 more rows, and 2 more variables: average_vaccinations <dbl>,
## #   cluster <int>
# looking at final mediods
df2 %>% slice(pam1$id.med)
## # A tibble: 8 x 6
##   State_AgeGroup cases Hospitalizations Deaths percentage_vacc… average_vaccina…
##   <fct>          <dbl>            <dbl>  <int>            <dbl>            <dbl>
## 1 Montana_less6…   961                9     19             44               44.3
## 2 Utah_less6mon…   988              302     48             49.8             49.9
## 3 Connecticut_l…  5491              766     35             51.4             53.3
## 4 Georgia_less6…  1012              860     81             43.4             43.8
## 5 West Virginia…  1214              853     22             50.5             53.4
## 6 South Dakota_…  2081              422     23             59.5             61.0
## 7 Nebraska_less…   430               40     15             54.7             55.1
## 8 New York_less…  2738              909    206             48.4             50.6

Its interesting that all the final mediods are in the vaccination age group of ‘less than 6 months.’


# graphing clusters
df3 <- df2 %>% select(-1) %>% mutate(cluster = as.factor(pam1$clustering))
df3 %>% ggpairs(columns = 1:6, aes(color = cluster), 
    upper = list(continuous = wrap("cor", size = 9))) + 
    theme_grey(base_size = 30)


I chose 8 clusters because it was a peak in the silhouette peak graph(shown by a red dot). The best number of clusters, 21, was too large to compute and would not have revealed anything significant about the clusters because there would be so little observations in each cluster. The next best cluster peak number was 14, but the graphs were too busy and confusing to interpret. The next best cluster number was 8, which is what was used to generate the graphs above.

Cluster 1 (red) is probably clustered most strongly by number of deaths, with a high peak of around 200. Cluster 2 (yellow) is tightly clustered at three peaks for average vaccinations, having the highest number of state_AgeGroups in its cluster. Cluster 5 and 6 aslo cluster tightly in average percent vaccinations, and weaker in deaths. Cluster 7 (purple) exhibits a cluster of state_AgeGroups with a high number of cases, hospitalization, and deaths. As expected from our correlation heat map, as well as from the correlations shown in the graph above, cluster 7 exhibits a low percentage vaccinated and a very wide distribution of average vaccinations. Overall, the clustering shown in the graph above does not reveal any striking cluster groups, but it overall reveals that despite differences in case numbers and vaccination % accross the county, there is not a clear group of states_ageGroups that are preventing the flu the best (with very low case numbers and high vaccination percent), but cluster 7 is the closest to earning this title.

So what can be concluded from this dataset? Well, there is a negative correlation between vaccinations and hospitalizations/deaths due to the flu in the flu season of 2009-2010. There is also a very weak positive correlation between cases and vaccination percentage, but that could be more clearly evaluated in a more reliable dataset about cases. It is likely that there would be a much higher case number if vaccinations percentage was lower, but could only be tested if the data set was more reliable. Moral of the story: always get your flu vaccine if you can, it doesn’t seem to hurt!


gif image source: media.giphy